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ABSTRACT 

Objectives: Risk assessment is an important part of 
emergency patient care. Risk assessment tools based 
on biochemical data have the advantage that 
calculation can be automated and results can be easily 
provided. However, to be used clinically, existing tools 
have to be validated by independent researchers. This 
study involved an independent external validation of 
four risk stratification systems predicting death that rely 
primarily on biochemical variables. 
Design: Prospective observational study. 
Setting: The medical admission unit at a regional 
teaching hospital in Denmark. 
Participants: Of 5894 adult (age 15 or above) acutely 
admitted medical patients, 205 (3.5%) died during 
admission and 46 died (0.8%) within one calendar day. 
Interventions: None. 

IVIain outcome measures: The main outcome 
measure was the ability to identify patients at an 
increased risk of dying (discriminatory power) as area 
under the receiver-operating characteristic curve 
(AUROC) and the accuracy of the predicted probability 
(calibration) using the Hosmer-Lemeshow goodness-of- 
fit test. The endpoint was all-cause mortality, defined in 
accordance with the original manuscripts. 
Results: Using the original coefficients, all four 
systems were excellent at identifying patients at 
increased risk (discriminatory power, AUROC >0.80). 
The accuracy was poor (we could assess calibration for 
two systems, which failed). After recalculation of the 
coefficients, two systems had improved discriminatory 
power and two remained unchanged. Calibration failed 
for one system in the validation cohort. 
Conclusions: Four biochemical risk stratification 
systems can risk-stratify the acutely admitted medical 
patients for mortality with excellent discriminatory 
power. We could improve the models for use in our 
setting by recalculating the risk coefficient for the 
chosen variables. 
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INTRODUCTION 

An important part of the routine work of front- 
line personnel in emergency departments and 
admission units is to assess the risk of individual 



Article focus 

■ Physicians staffing emergency departments and 
admission units are not comfortable predicting 
the risk of mortality for their patients. 

■ Several systems that can do this have been 
developed but not externally validated and 
should thus not yet be used in clinical practice. 

■ The aim of this article was to validate four exist- 
ing biochemical risk stratification systems pre- 
dicting mortality of acutely admitted patients. 

Key messages ' 

■ The four risk prediction systems based on bio- 
chemical data are excellent at predicting mortality 
of acutely admitted medical patients 

■ The precision of the predictions is low, but can 
be improved by adjusting the systems to the 
local environment by recalculating the scores. 

Strengths and limitations of this study 

■ This is the largest study to validate biochemical 
based risk stratification systems in a medical 
admission unit. 

■ This study has good external validity and a low 
risk of selection bias. 

■ The study is limited by missing data especially in 
two of the four scores and by the fact that it is a^ 
single centre study. 



patients. However, many physicians feel inad- 
equately trained,^ and prognostication is not a 
mandatory part of medical education.^ As a con- 
sequence, automated risk stratification could 
assist physicians attending to emergency 
patients. However, in a recent review,^ none of 
the risk stratification tools for use in the emer- 
gency departments and admission units attained 
the highest level of evidence. Several systems 
have been developed, but only a few have been 
externally validated, even though this is an 
important part of the development process.^ 

Some of the existing risk stratification 
systems are based solely on vital signs and 
others on biochemical analyses. Systems 
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based on vital signs require manual collection of data, 
whereas systems based on biochemical analyses can be 
automated. Data can easily be extracted from the hos- 
pital computer systems and risk stratification can be per- 
formed in an automated process. 

We performed the present study with the objective of 
validating existing risk stratification systems that predict 
mortality for medical patients based solely on biochemical 
data. Four systems based on multiple (more than two) rou- 
tinely available variables (in our setting) and not restricted 
to selected groups of medical patients were included. 

IVIETHODS 

We performed an external validation of existing bio- 
chemical risk stratification systems by applying the coeffi- 
cients and ORs reported in the original papers. 
Furthermore, we validated the choice of variables in the 
original papers by recalculating the coefficients to fit 
our current patient population. 

Setting 

Sydvestjysk Sygehus is a 460-bed regional teaching hos- 
pital in the western part of Denmark with a contingency 
population of 220 000. All subspecialties of internal 
medicine are represented. 

Patients can be admitted to the medical admission unit 
(MAU) by their general practitioner, out-of-hours emer- 
gency medical service, outpatient clinics, emergency 
department and ambulance services. Two attending phy- 
sicians, one in internal medicine and one in cardiology, 
one senior resident and two interns staff the MAU. 

Design and data 

We conducted a prospective observational cohort study 
of all patients admitted through the MAU at our hos- 
pital. All consecutive adult patients (ages >15 years) 
admitted from 2 October 2008 until 19 February 2009 
(first cohort) and from 23 February 2010 until 26 May 
2010 (second cohort) were included in the study. 

Upon admission, a nurse recorded the vital signs and 
registered these along with demographic information 
and the primary complaint on a form. After inclusion of 
all patients, we extracted blood test results from the hos- 
pital computer systems. No extra biochemical analyses 
were added as part of this study, and only analyses 
ordered by the admitting doctor were included. Most 
patients had the following biochemical standard panel 
taken: haemoglobin, leukocytes, platelets, C reactive 
protein, sodium, potassium, creatine, urea, total 
calcium, glucose and albumin. Almost all patients admit- 
ted to the cardiology section had troponin, amylase and 
total cholesterol measured as well. We included blood 
tests drawn 1 h prior to admission and within 6 h after 
admission. If a patient had multiple analyses of the same 
biochemical variable, only the first was included. In case 
of missing data on forms (or completely missing forms), 
data were extracted from an electronic copy of the 



nurse's notes or the chart. Inclusion of all patients was 
ensured by validation against the central hospital data- 
base. As we have no formalised classification system for 
primary complaints, one of the authors (MB) converted 
the primary complaint to a diagnosis according to the 
International Statistical Classification of Diseases and 
Related Health Problems, 10th Revision (ICD-10)^ and 
compiled these as admissions due to 

► Infectious disorders (ICD-10 diagnoses A and B); 

► Malignancy (ICD-10 diagnoses C and D); 

► Endocrine disorders (ICD-10 diagnoses E); 

► Circulatory disorders (ICD-10 diagnoses I); 

► Pulmonary disorders (ICD-10 diagnoses J); 

► Symptoms (ICD-10 diagnoses R); 

► Observational reasons (ICD-10 diagnoses Z); 

► Other reasons (ICD-10 diagnoses F, G, H, K, L, M, N, 
O, P, Q,S, T, XandY). 

We analysed the performance of four different risk 
stratification systems based on biochemical variables: the 
system introduced by Prytherch et af required gender, 
mode of admission, age, urea, sodium, potassium, 
albumin, haemoglobin, white cell count and creatine. 
Froom and Shimoni'^ included age, albumin, alkaline 
phosphatase, aspartate aminotransferase, urea, glucose, 
lactate dehydrogenase, neutrophil count proportion and 
total leucocyte count. Loekito et af required haemoglo- 
bin, haematocrit, total CO2, leucocytes, albumin, biliru- 
bin, creatine and urea. We estimated haematocrit from 
haemoglobin^ and total CO2 from bicarbonate.^^ The 
score by AsadoUahi et at^ required age, urea, haemoglo- 
bin, leucocytes, platelets, sodium and glucose. If the 
patient missed one or more of the biochemical variables 
required for a given risk assessment tool, the patient was 
excluded from the validation of that tool. 

We defined the primary outcome as in the original 
articles, that is, in-hospital mortality for Prytherch et af 
AsadoUiahi et at^ and Froom and Shimoni'^ and immi- 
nent death (ie, death within one calendar day after the 
blood was drawn) for Loekito et af Data on this were 
extracted from the hospital computer systems after the 
inclusion was completed and all patients were either dis- 
charged or dead. 

The study was approved by the Danish Data Protection 
Agency. Approval from an Ethics Committee was not 
required according to Danish law. The study is reported 
in accordance with the STROBE statement. 

Statistics 

The sample size was dictated by another part of the 
study. In brief, the sample size was calibrated to develop 
and validate a risk-stratification system to predict 7-day 
all-cause mortality. 

We calculated the predicted mortality using the coeffi- 
cients presented in the original papers. To assess the 
ability of each system to identify patients at highest risk 
of dying (ie, the discriminatory power), we calculated 
the area under the receiver-operating characteristic 
curve (AUROC). AUROC is a summary measure of 
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Table 1 Demographics of patients 



Total, n=5894 



First cohort, n=3046 



Second cohort, n=2848 



Female 2950 (50.1%) 1460 (47.9%) 

Age (years) 65 (49-77) 66 (50-77) 

Length of stay (days) 2 (1-6) 2 (1-6) 

In-hospital mortality 205 (3.5%) 1 1 6 (3.8%) 

Imminent death 46 (0.8%) 26 (0.9%) 

Admitted due to infectious disorder 178 (3.0%) 82 (2.7%) 

Admitted due to malignant disorder 128 (2.2%) 50 (1 .6%) 

Admitted due to endocrine disorder 307 (5.2%) 147 (4.8%) 

Admitted due to circulatory disorder 1 375 (23.4%) 527 (1 7.3%) 

Admitted due to pulmonary disorder 972 (1 6.5%) 547 (1 8.0%) 

Admitted due to symptoms 1 1 94 (20.3%) 71 9 (23.6%) 

Admitted due to observation 1012(1 7.2%) 585 (1 9.2%) 

Admitted due to other reasons 71 8 (1 2.2%) 389 (1 2.8%) 



1490 
64 
1 

89 
20 
96 
78 
160 
848 
425 
475 
427 
329 



(52.3%) 

(48-76) 

(1-5) 

(3.1%) 

(0.7%) 

(3.4%) 

(2.8%) 

(5.6%) 

(29.9%) 

(15.0%) 

(16.7%) 

(15.1%) 

(11.6%) 



sensitivity and specificity at each possible cut-off and 
basically represents the probability that a patient who 
eventually dies will have a higher score than a patient 
who survives. An AUROC above 0.8 is said to represent 
excellent discriminatory power. The calibration was 
assessed using the Hosmer-Lemeshow goodness-of-fit 
test. The calibration assesses if the observed mortality 
rate matches the expected rate, derived from the scoring 
systems. For this test, we divided the population into 
decentiles by expected event rate. A p value above 0.05 
indicates acceptable calibration. A scoring system might 
show excellent discriminatory power and yet have poor 
calibration if, for example, it was developed on a popula- 
tion with low overall mortality and then applied to a 
population with high overall mortality. 



As the predictive power would be expected to vary 
across populations, we calculated the AUROC of each of 
the original scores for patients presenting with the previ- 
ously specified presenting complaints. 

Finally, we attempted to optimise the models to our 
setting by recalculating the scoring coefficients; that is, 
we performed the multivariable analyses anew by using 
the variables included in the original models. We used 
the first cohort (collected from 2008 to 2009) for the 
development and the second cohort (collected in 2010) 
for validation of the recalculated coefficients. 

As the AsadoUahi score^^ is a set score (ranging from 0 
to 20) and not a regression formula, we initially performed 
a new logistic regression using our development cohort. 
From the coefficients derived, we assigned a score (from 1 



fable 2 Variables included in the scores and the level of missing data 



Variable 


Percentage of 
missing 


Prytherch 
score® 


Froom 
score^ 


Loekito 
score® 


Asadollahi 
score^^ 


Lactate dehydrogenase 


76.6 










Bilirubin 


75.1 






• 




Alkaline phosphatase 


75.0 










Bicarbonate 


71.6 






• 




Alanine aminotransferase 


68.3 










Neutrophil count 


42.1 










proportion 












Urea/creatine 


13.0 


• 








Urea 


12.7 


• 




• 


• 


Albumin 


7.5 


• 




• 




Platelets 


7.1 








• 


Glucose 


6.9 








• 


White cell count 


6.0 


• 




• 


• 


Creatine 


5.8 


• 




• 




Potassium 


5.5 


• 








Sodium 


5.2 


• 






• 


Haemoglobin 


5.1 


• 




• 


• 


Haematocrit 


5.1 






• 




Age 


0.0 


• 


• 


• 


• 


Gender 


0.0 


• 








Mode of admission 


0.0 


• 









•Required in the score. 
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Figure 1 Discriminatory power of four risk stratification systems based on biochemical variables. Original coefficients were used 
to generate receiver-operating curves. 



to 6) to each variable and recalculated the score for both 
cohorts. We tested calibration according to Seymour 
et al}"^ that is , we predicted the probabilities of the individ- 
ual scores using logistic regression analysis and calculated 
the Hosmer-Lemeshow goodness-of-fit test. 

Data are reported as median (IQR) or proportions 
whenever appropriate. Differences between patients with 
and without missing data were tested using the test or 
Wilcoxon rank-sum test. 

STATAV.12.1 (StataCorp, College Station, Texas, USA) 
was used for the analyses. 



RESULTS 

A total of 5894 patients were included in our study (see 
table 1 for details). Among these, 205 (3.5%) died during 
the admission, and 46 (0.8%) died within one calendar day. 



Validation of the original scores 

We could include 4925 patients (83.6% of the entire 
cohort) in the Prytherch score (table 2). Using the ori- 
ginal formula, we found an AUROC of 0.842 (95% CI 
0.818 to 0.865; figure 1 and table 3) and goodness-of-fit 
test, xMl9.63 (10 de grees of freedom), p<0.001. Thus, 
the Pryterch score showed a good ability to identify 
patients at high risk of dying, but failed in calibration, as 
fewer patients died than expected. 

In calculating the Froom score, ^ we could include 
only 919 patients (15.6%; table 2). Using the ORs speci- 
fied in the original article, we found an AUROC of 
0.862 (95% CI 0.813 to 0.910; figure 1 and table 3). As 
the original paper did not provide the coefficient for 
the intercept, we were unable to reliably assess calibra- 
tion. In an attempt to reduce selection bias, Froom and 
Shimoni^ used imputation of the mean (by assigning the 



Table 3 Performance of the model using the original coefficients and after recalculation 



Discriminatory power 
(AUROC) 



Recalculated model 



Score 



Original model 



Development 



Validation 



Calibration 

(Hosmer-Lemeshow test p value) 
Original Recalculated model 
model Development Validation 



Prytherch 
score^ 

Froom score^ 
Loekito score^ 
Asadollahi 
score^^ 



0.842 (0.818-0.865) 0.858 (0.827-0.889) 0.874 (0.841-0.907) <0.001 



0.862 (0.813-0.910) 
0.922 (0.879-0.965) 
0.803 (0.776-0.829) 



0.930 (0.897-0.962) 
0.911 (0.819-1.000) 
0.808 (0.774-0.842) 



0.882 (0.806-0.957) 
0.917 (0.823-1.000) 
0.813 (0.772-0.854) 



0.0007 



0.59 

0.93 
0.79 
0.79 



0.66 

0.009 

1.00 

0.47 



Area under receiver-operating curve (AUROC) above 0.8 represents good discriminatory power, and p value for calibration above 0.05 
represents good calibration 
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value of 2.5 to all missing variables reduced into quar- 
tiles). Adapting this approach led to the inclusion of all 
5894 patients with an AUROC of 0.814 (CI 0.788 to 
0.841). Again, because of a missing coefficient for the 
intercept, we could not assess calibration. Thus, the 
Froom score was good at identifying patients at high 
risk, but we could not assess the level of precision. 

As for the Loekito score, ^ we could include 540 
patients (9.2%; table 2). Using the reported coefficients, 
we found an excellent discriminatory power 
(AUROC=0.922, CI 0.879 to 0.965, figure 1 and table 3). 
Calibration failed with a goodness-of-fit test, x^=30.7, 
p=0.0007. Thus, the Loekito score showed excellent dis- 
criminatory power but failed calibration. 

We could include 4863 (82.5%) in the Asadollahi 
score (table 2). We found a good calibration 
(AUROC=0.803; CI 0.776 to 0.829; figure 1 and table 3), 
but could not assess it because of the construction of 
the score in the original article. 

The predictive ability of each score varied widely with 
each presenting compliant; however, within each com- 
plaint, the scores more or less had identical AUROCs 
(table 4). Overall, malignant, endocrine and pulmonary 
disorders had the lowest AUROC, while infectious 
disorders had the highest (table 4). Some of these calcu- 
lations are based on limited numbers (as indicated by 
the CIs). 

Recalculated coefficients 

Performing the recalculation of the Prytherch score, ^ we 
achieved excellent AUROCs in both cohorts as well as 
acceptable calibration (figure 2 and table 3). Sex, urea, 
sodium, haemoglobin, creatine and potassium were not 
significantly associated with in-hospital mortality in our 
material, but because they were included in the original, 
we kept them in the analysis. 

Recalculating the Froom score, we achieved excellent 
AUROCs in both cohorts, but calibration failed in the 
validation cohort (figure 2 and table 3). Age, alkaline 
phosphatase, alanine aminotransferase, urea, white cell 
count and glucose were not significantly associated with 
in-hospital mortality, but were kept in the model. 

When recalculating the Loekito score, ^ we found that 
urea, creatine, albumin, haemoglobin and white cell 
count were not significantly associated with the endpoint 
of 1-day mortality. We achieved excellent AUROCs 
in both cohorts as well as almost perfect calibration 
(figure 2 and table 3). 

When recalculating the Asadollahi score, we 
assigned a score of one each to haemoglobin, platelets 
and glucose (none of which were significantly associated 
with the endpoint), three to sodium, four each to age 
and white cell count and six to urea. AUROC was excel- 
lent in both cohorts and calibration acceptable ( figure 2 
and table 3). 

In all four methods, the discriminatory power remained 
constant or improved when we compared it with the calcu- 
lation based on the original coefficients and ORs. 
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Figure 2 Discriminatory power after recalculation of new coefficients to match our setting. 



Selection bias 

For the Prytherch, Froom and Asadollahi scores, patients 
who were excluded because of missing values had the 
same mortality as those who were included (table 5). For 
the Loekito score, patients with missing data had signifi- 
cantly lower 1-day mortality (table 5). 

DISCUSSION 

Using four existing biochemical-based risk stratification 
systems, we could risk-stratify acutely admitted medical 
patients with excellent discriminatory power. We could 
only evaluate the calibration for two scores, the 
Prytherch score^ and the Loekito score, ^ which both 
failed. When recalculating all four scores, both discrim- 
inatory power and calibration improved, except for the 
Froom score, ^ where calibration failed. 

In the present article, we focused only on biochemical- 
based risk stratification systems. While systems based on 
vital signs can be calculated shortly after arrival, 
biochemical-based systems require the blood tests to be 
analysed first. On the other hand, for systems based only 
on biochemical data, interobserver or intraobserver vari- 
ation is virtually eliminated. We have identified four 
systems with broad inclusion criteria that could poten- 
tially be used in emergency departments and MAUs. 
The systems included were developed in different set- 
tings, ranging from floor beds^ ^ to a medical emer- 
gency room.^ One was internally validated using a split 
sample technique,^ while the others were validated in 
external cohorts.'^ ^ However, even if the systems were 
developed in a setting similar to ours and validated by 
the original authors, they still need to be externally 



validated in independent cohorts, as we now have per- 
formed, before they should be used in the clinical 
routine.^ 

Although all four systems had acceptable discrimin- 
atory power, two systems failed in calibration. One way of 
correcting poor calibration is to perform a recalculation. 
We have carried out so by performing a multivariable 
logistic regression in one cohort and then validating 
it in another. This approach generally improved the 
discriminatory power and made calibration acceptable. 
In fact, calibration became acceptable in both systems 
that previously failed. After recalculation, however, 
calibration failed in the Froom score, a system for 
which we could not test calibration using the original 
formula. Our best explanation for this is differences in 
mortality because the Froom score^ was developed and 
validated in cohorts with a higher mortality than ours 
(5.6% vs 3.5%). 

The Prytherch score^ seems to fit our setting best. The 
discriminatory power was excellent both before and 
after recalculation. Calibration failed before recalcu- 
lation, but was acceptable afterwards. Most important, 
using our standard biochemical profile, we could 
include the majority of our patients. Both the Froom^ 
and Loekito scores^ performed better, but only margin- 
ally, and the Froom score^ failed on calibration after 
recalculation; we could include only a few of our 
patients in both scores. However, the choice of score 
depends on several additional factors. Some hospitals 
might not routinely measure all investigations required 
by each score (eg, albumin) and some investigations are 
error prone (eg, haemolysis in potassium measure- 
ments). The Asadollahi score only relies on seven 
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parameters and could thus be easily obtained and 
perhaps less expensive to report on most patients. Also, 
it is not significantly inferior to the other scores and 
might therefore be more suitable for other settings. 

Our study has limitations. First, we have a substantial 
amount of missing data. This absence is not a major 
problem when calculating the Prytherch^ or Asadollahi 
score, but it was for the Froom'^ and Loekito scores.^ 
There is no doubt that this has introduced selection bias 
into our study. Although that we have not been able to 
demonstrate any selection bias for the Prytherch, Froom 
and Asadollahi scores looking at our primary endpoint 
of mortality,^ ^ we showed that patients with missing 
data in the Loekito score^ had a significantly lower mor- 
tality. An apparent explanation is that bicarbonate is 
part of the formula. At our institution, bicarbonate is 
mostly analysed as part of arterial blood gas analyses and 
thus primarily measured in the most critically ill 
patients. Patients with missing data also had a signifi- 
cantly shorter length of stay, but were not uniformly 
older or younger than patients that could be included in 
each score (table 5). These indications of selection 
biases prompt us to question the external validity and 
generalisability of our findings, and we see this as an 
indication that further studies, where the risk of selec- 
tion bias is minimised, are required. Second, the 
Loekito score^ requires haematocrit (we estimated this 
using the haemoglobin level^) and total CO2 (which we 
estimated using bicarbonate).^^ However, when perform- 
ing our own logistic regression analyses of both systems, 
we had acceptable results, proving this to be of no 
concern. Third, this study still represents a single centre 
application of the scoring systems, and the results 
should be evaluated with this in mind. Fourth, we run a 
risk of overfitting^^"^'^ when performing recalculation. 
With only 26 imminent fatalities in the development 
cohort, overfitting is a potential risk for the Loekito 
score. ^ However, our validation proves that it was not an 
issue. As for the other three systems, we have enough 
fatalities for a valid recalculation. 

We have found that four risk stratification systems 
based on biochemical data can identify patients at an 
increased risk of dying, although with limited precision. 
The models could be improved by recalculation, but the 
question remains if the use of these systems will improve 
clinical practice. In an ideal study, patients should be 
randomised to either be risk-stratified by a predefined 
system or be managed by clinical assessment alone, and 
the potential improvement in treatment should be mea- 
sured. This approach is a complicated setup not previ- 
ously performed for any of the present systems, but is 
the only way to show if the implementation of the system 
matters. 
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